There is a lot of junk science about what makes a good wine. This analysis will use data to draw conclusions about what really makes a good red wine.
A quick look at a boxplot matrix of our data shows that our data is generally skewed right. We can also see that some of our variables, like chlorides and residual sugar, have very low spread and many outliers. Viewing the data in this way helps us see what sort of units each variable uses.
The quality of the wines has a fairly normal distribution. The average wine has a quality of 5.64. Later, we will perform more advanced analysis to find out what factors contribute to a high quality.
The alcohol content of red wine is skewed right; the average bottle of red wine is 10.42% alcohol but it can be as high as 14.9%.
The sulphate content of red wine is also skewed heavily right. The average bottle of red wine has a sulphate level of 0.66 but the highest level is 2.
The pH levels for red wine are normally distributed with an average pH level of 3.31.
The density of red wine is also normally distributed with an average density of 0.97.
Total sulfur dioxide for red wine is skewed right with an average of 46.47.
Free sulfur dioxide is similarly shaped to total sulfur dioxide, but with a much lower average of 15.87.
The data for chlorides in red wine is heavily skewed left and has a very long tail. While the mean chloride count is 0.09, the median is 0.08 due to the long tail. The maximum value here is a massive 0.61.
The data on residual sugar is very similar to the data on chlorides. The residual sugars in red wine have an average of 2.54 and a max of 15.5.
Citric acid levels in red wine are skewed left with an average level of 0.27.
The volatile acidity of red wine is normally distributed with a slight left skew. The average here is 0.53.
And finally, the fixed acidity of red wine is skewed left with an average value of 8.32.
There are 12 data points collected on 1599 observations of red wine. The data is tidy; I came accross no missing values while calculating averages.
All of the data points are scientifically observable with the exception of quality. This allows us to compare the chemical properties against the quality rating to deduce the properties of desirable wine.
No. The data set has a good number of variables already, and creating more would require an advanced investigation into wine making that goes outside the scope of what I am trying to achieve.
Most of the data has some pretty clear outliers. I trimmed the highest 1% of values from most of the variables, bringing the number of observations down from 1599 to 1470. I also dropped the x column as it only served to uniquely identify each bottle of wine, which is not important in this investigation.
First, we’ll look at a paired plot of all the data. There really isn’t anything here that we would call a strong correlation, although there are a few moderate ones.
Alcohol has the highest correlation with quality at 0.49. We can see here that there is a relationship between them, although we would be hard pressed to call a correlation of 0.49 strong. The correlations with volatile acidity and sulphates are large enough to be noticed inthis dataset, but can only really be called weak at best.
Since finding a highly correlated relationship to quality seems to be a bust, we should take the time to examine other relationships in the dataset. We’ll start with alcohol since it has the strongest positive correlation with quality.
Alcohol only has one other interesting correlation. A high alcohol content is associated with a lower density.
Before we end this stage of our investigation, we’ll take a look at a variable with several interesting correlations:
These relationships are almost high enough to be considered strong, but we will err on the side of caution and say they are moderate. Of most interest to me is the fact that the relationship between fixed acidity and pH is not strong; it makes sense that the relationship would be negative, since a lower pH value means higher acidity, but since pH is a measure of acidity I would assume the relationship would be stronger.
I had assumed that there would be some features of wine that would stand out as being indicative of quality. Interestingly, there were no strong relationships, and the only moderate correlation to quality was alcohol content.
The most notable relationships were fixed acidity to citric acid, density, and pH and citric acid to fixed acidity, volatile acidity, and pH. Together, they form a cluster of relationships that seem interesting for future investigation.
The strongest relationship was between pH level and fixed acidity with a correlation of -0.69. This makes sense, since pH is a measure of acidity, and the lower the pH, the higher the acidity.
For this stage of our investigation, I would like to create some advanced plots that might show some potentially hidden relationships with quality. We’ll start with a plot of alcohol, density, and quality since this is the strongest chain of relationships we have to quality.
We can see that the groupings by quality trend toward lower density and higher alcohol levels. It is interesting to note that red wines at the bottom of the quality spectrum have the most atypical grouping.
I would also like to create a regression model to show the relationship between quality, alcohol, and density:
## (Intercept) alcohol density
## -29.0090120 0.4109539 30.4704353
##
## Call:
## lm(formula = quality ~ alcohol + density, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6438 -0.3883 -0.1462 0.4978 2.5509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.00901 11.97987 -2.421 0.0156 *
## alcohol 0.41095 0.02048 20.070 <2e-16 ***
## density 30.47044 11.91075 2.558 0.0106 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.684 on 1467 degrees of freedom
## Multiple R-squared: 0.246, Adjusted R-squared: 0.245
## F-statistic: 239.3 on 2 and 1467 DF, p-value: < 2.2e-16
##
## Welch Two Sample t-test
##
## data: df2$quality and c(df2$alcohol, df2$density)
## t = -0.71079, df = 3250.1, p-value = 0.2386
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.08421926
## sample estimates:
## mean of x mean of y
## 5.636054 5.700110
This is very interesting. Creating a linear model from the data, we see that alcohol and density have a p-score of 0.24, well within our confidence interval. Accordingly, we would have to reject our null hypothesis and conclue that they are significant factors in red wine quality. Why is this? We’ll come back to this issue more in the summary.
Since alcohol seems to be the largest factor in quality, let’s take a look at alcohol and quality in relation to some other variables:
We will examine the relationships between these further in the next section.
The multivariate analysis of this dataset was crucial. Looking at only the correlation values for the data, we might conclude that the correlations are too low to show any real relationship. However, our multivariate analysis showed that this clearly was not the case, and fitting the data to a linear model proved this.
Despite alcohol having the strongest correlation to quality (0.49) and density having a weaker correlation (-0.2), our linear model showed that density also has a role to play in quality. When looking at the correlation between alcohol and density (-0.5), we can see that this makes sense.
The strength of my model is that it uses a chain of three variables with reasonably high correlations to show a powerful relationship. It could be made stronger by taking other variables more directly correlated with quality and seeing how powerful their effect is.
The bivariate analysis of quality against alcohol is one of my favorites. Alcohol is far and away the highest correlated value to quality and it brings up many questions, enough that we could conduct an entirely new investigation. It is commonly believed that alcohol tastes bad, and the more like alcohol something tastes, the worse we think of it. Many drinks with high alcohol content are diluted to avoid the flavor; in fact, wine is one of the few drinks that is generally not diluted. Do wine drinkers enjoy the taste of alcohol? Are the testers who gaves these ratings biased towards highly alcoholic drinks? Does a high alcohol by volume mean that a tester will be in a better moood when it comes time to rate for quality?
These two plots in conjunction tell an interesting story. Higher quality wines tend to not only have more alcohol, but also lower volatile acidity and higher citric acid. This seems to imply that there are different kinds of acidity and they work quite differently. I would want to do more research before drawing conclusions, but it seems as though volatile acidity might indicate a more “vinegary” wine, while higher citric acid indicates more fruit flavors.
This chart is my favorite due to the lesson it teaches about correlation and p-value. At first glance, it looks like low correlations across the board means that none of our values significantly impact quality. However, when we create a linear model comparing alcohol and density to quality, we find an entirely different story. Not only does alcohol have a relationship to quality with a p-value of practically 0, but we find that density has a large impact as well by way of alcohol: density negatively impacts alcohol, which positively impacts quality.
This analysis was underwhelming at first. I expected to rush in, find some interesting insights, and zoom off and away to the next project. Instead, I discovered over ten variables that did not correlate well with quality, leading me to think that there were no good indicators for what made a good red wine.
I almost didn’t do a linear model as part of this analysis, but I am very glad I did. It was frustrating to find out that the regression model didn’t match up well with my correlations. How could something have such a low p-value but have a correlation of less than 0.5?
When running correlations, you often hear some loose numbers for what makes a good correlation: a good correlation is 0.6, 0.7 or even 0.9 and above. This dataset, for the most part, did not include factors with correlations nearly that high. To make things even more strange, factors that should be obviously correlated did not score high as well. What’s the deal with pH and fixed acidity have a correlation of -0.68, but pH and volatile.acidity having a correlation of 0.029?
Using this dataset made me question a lot of beliefs I had about what is and is not relevant or statistically meaningful data. There are ideal values and thresholds in statistics, but not every dataset will have them.
For future analysis, I would like to focus on the relationship between quality and alcohol. Does this hold true for other alcoholic beverages? The craft beer scene is booming right now; are there corollaries between the two? What about for other kinds of wine, like plum? How do acid levels affect these drinks?